Introduction

Problems/Questions To Be Answered

  • What shoes are the most popular?
  • Which shoes are the most/least profitable?
  • Does region or size affect the profit?

Reading the Data

Data Wrangling and Cleaning

library(readr)
sneaker_data <- read_csv("StockX-Data-Contest-2019-3.csv",  )
## Rows: 99956 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Order Date, Brand, Sneaker Name, Sale Price, Retail Price, Release ...
## dbl (1): Shoe Size
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sneaker_data)
## # A tibble: 6 × 8
##   `Order Date` Brand `Sneaker Name`      Sale …¹ Retai…² Relea…³ Shoe …⁴ Buyer…⁵
##   <chr>        <chr> <chr>               <chr>   <chr>   <chr>     <dbl> <chr>  
## 1 9/1/2017     Yeezy Adidas-Yeezy-Boost… $1,097  $220    9/24/2…    11   Califo…
## 2 9/1/2017     Yeezy Adidas-Yeezy-Boost… $685    $220    11/23/…    11   Califo…
## 3 9/1/2017     Yeezy Adidas-Yeezy-Boost… $690    $220    11/23/…    11   Califo…
## 4 9/1/2017     Yeezy Adidas-Yeezy-Boost… $1,075  $220    11/23/…    11.5 Kentuc…
## 5 9/1/2017     Yeezy Adidas-Yeezy-Boost… $828    $220    2/11/2…    11   Rhode …
## 6 9/1/2017     Yeezy Adidas-Yeezy-Boost… $798    $220    2/11/2…     8.5 Michig…
## # … with abbreviated variable names ¹​`Sale Price`, ²​`Retail Price`,
## #   ³​`Release Date`, ⁴​`Shoe Size`, ⁵​`Buyer Region`

From this we can see that there are a few formatting issues. The column types are wrong, as the price variables and date variables need to be numeric. Additionally, I will change the names of the columns for workability.

colnames(sneaker_data) <- c('Order_Date',
                            'Brand',
                            'Shoe_Name',
                            'Resale_Price',
                            'Retail_Price',
                            'Release_Date',
                            'Shoe_Size',
                            'Buy_Region')

sneaker_data2 <- sneaker_data %>%
  mutate(Order_Date = as.Date(Order_Date, format = "%m/%d/%Y")) %>%
  mutate(Resale_Price = parse_number(Resale_Price)) %>%
  mutate(Retail_Price = parse_number(Retail_Price)) %>%
  mutate(Release_Date = as.Date(Release_Date, format = "%m/%d/%Y")) %>%
  mutate(Shoe_Size = as.numeric(Shoe_Size))
head(sneaker_data2)
## # A tibble: 6 × 8
##   Order_Date Brand Shoe_Name          Resal…¹ Retai…² Release_…³ Shoe_…⁴ Buy_R…⁵
##   <date>     <chr> <chr>                <dbl>   <dbl> <date>       <dbl> <chr>  
## 1 2017-09-01 Yeezy Adidas-Yeezy-Boos…    1097     220 2016-09-24    11   Califo…
## 2 2017-09-01 Yeezy Adidas-Yeezy-Boos…     685     220 2016-11-23    11   Califo…
## 3 2017-09-01 Yeezy Adidas-Yeezy-Boos…     690     220 2016-11-23    11   Califo…
## 4 2017-09-01 Yeezy Adidas-Yeezy-Boos…    1075     220 2016-11-23    11.5 Kentuc…
## 5 2017-09-01 Yeezy Adidas-Yeezy-Boos…     828     220 2017-02-11    11   Rhode …
## 6 2017-09-01 Yeezy Adidas-Yeezy-Boos…     798     220 2017-02-11     8.5 Michig…
## # … with abbreviated variable names ¹​Resale_Price, ²​Retail_Price,
## #   ³​Release_Date, ⁴​Shoe_Size, ⁵​Buy_Region
sum(is.na(sneaker_data2))
## [1] 0

The column types are all correct, and there are no missing values in the data set. Now that the data is properly manipulated, we can begin exploring it.

Data Exploration

Average Resale Price of Each Shoe By Brand

## `summarise()` has grouped output by 'Shoe_Name'. You can override using the
## `.groups` argument.

It is interesting to see how much aftermarket value some sneakers have. Generally speaking, most of the Nike Off White shoes resale for double their retail value.

Most Profitable Shoe

profitOfShoe <- avgResaleBySneaker %>% summarize(Brand, Shoe_Name, Retail_Price, Average_Resale_Price, Average_Profit = Average_Resale_Price - Retail_Price) %>% unique() 
## `summarise()` has grouped output by 'Shoe_Name'. You can override using the
## `.groups` argument.
head(profitOfShoe %>% arrange(-Average_Profit),1)
## # A tibble: 1 × 5
## # Groups:   Shoe_Name [1]
##   Shoe_Name                               Brand     Retail_Price Avera…¹ Avera…²
##   <chr>                                   <chr>            <dbl>   <dbl>   <dbl>
## 1 Air-Jordan-1-Retro-High-Off-White-White Off-White          190   1826.   1636.
## # … with abbreviated variable names ¹​Average_Resale_Price, ²​Average_Profit

The Nike Off-White Air Jordan 1 in the white colorway has the highest profit of $1636.

Least Profitable Shoe

head(profitOfShoe %>% arrange(+Average_Profit),1)
## # A tibble: 1 × 5
## # Groups:   Shoe_Name [1]
##   Shoe_Name                        Brand Retail_Price Average_Resale_P…¹ Avera…²
##   <chr>                            <chr>        <dbl>              <dbl>   <dbl>
## 1 Adidas-Yeezy-Boost-350-V2-Sesame Yeezy          220               264.    44.1
## # … with abbreviated variable names ¹​Average_Resale_Price, ²​Average_Profit

The Adidas Yeezy Boost 250 V2 in the Sesame colorway has the lowest profit of $44.

Yeezy Sneaker Average Profit and Retail Interactive Bar Graph

Hover over a color to see the resale value, name, and retail cost of an Adidas Yeezy Sneaker.

The y-axis displays the cumulative profit for all sneakers in the category, but hovering over the color of a sneaker on the graph will represent the average profit of that shoe.

Interactive Plot of Nike Off-White Sneaker Profit

Hover over a color to see the resale value, name, and retail cost of a Nike Off-White sneaker.

The y-axis displays the cumulative profit for all sneakers in the category, but hovering over the color of a sneaker on the graph will represent the average profit of that shoe.

What Factors Can Affect Profit?

Now that we have discovered the most and least profitable shoes, it seems fitting to explore and determine what role certain factors may or may not have on profitability.

Shoe Size?

Let’s see if there is a relationship between shoe size and profit by running a simple linear regression on shoe size and average profit.

Visualization of Average Profit and Shoe Size

## `summarise()` has grouped output by 'Shoe_Size'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From this graph we can see that there is a small trend at the end for larger shoe sizes and profit.

Simple Linear Regression of Shoe Size and Average Profit

sizeModel <- lm(Avg_Profit ~ Shoe_Size, data = sneakerSizePlotData)
sizeModel
## 
## Call:
## lm(formula = Avg_Profit ~ Shoe_Size, data = sneakerSizePlotData)
## 
## Coefficients:
## (Intercept)    Shoe_Size  
##     147.673        9.669
ggplot(sneakerSizePlotData, aes(x = Shoe_Size, y= Avg_Profit)) +  geom_point() + stat_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

Displayed above is the model that was computed and the graph with a regression line. We can now summarize and determine if there is a relationship.

summary(sizeModel)
## 
## Call:
## lm(formula = Avg_Profit ~ Shoe_Size, data = sneakerSizePlotData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -105.68  -13.37    0.73   16.38  664.46 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 147.67282    0.30536   483.6   <2e-16 ***
## Shoe_Size     9.66894    0.03171   304.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.35 on 99954 degrees of freedom
## Multiple R-squared:  0.4819, Adjusted R-squared:  0.4819 
## F-statistic: 9.298e+04 on 1 and 99954 DF,  p-value: < 2.2e-16

Based on this, there does not seem to be significant difference to suggest that the model has any predictive ability. There does not seem to be a relationship between shoe size and profit.

Region?

Now that we have examined the effect of shoe size on average profit, let’s take a look at buyer region. Following a similar procedure, we can see if region affects the profit of a shoe by running a simple linear regression.

Visualization of Average Profit by Buyer Region

## `summarise()` has grouped output by 'Buy_Region'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

There already does not seem to be a trend with the data.

Simple Linear Regression of Average Profit and Buyer Region

regionModel <- lm(Avg_Profit ~ Buy_Region, data = sneakerRegionPlotData)
regionModel
## 
## Call:
## lm(formula = Avg_Profit ~ Buy_Region, data = sneakerRegionPlotData)
## 
## Coefficients:
##                    (Intercept)                Buy_RegionAlaska  
##                        183.888                          44.636  
##              Buy_RegionArizona              Buy_RegionArkansas  
##                         56.474                          12.643  
##           Buy_RegionCalifornia              Buy_RegionColorado  
##                         87.538                          40.651  
##          Buy_RegionConnecticut              Buy_RegionDelaware  
##                         17.744                         113.763  
## Buy_RegionDistrict of Columbia               Buy_RegionFlorida  
##                         60.115                          53.984  
##              Buy_RegionGeorgia                Buy_RegionHawaii  
##                         37.465                         100.561  
##                Buy_RegionIdaho              Buy_RegionIllinois  
##                        -10.524                          35.857  
##              Buy_RegionIndiana                  Buy_RegionIowa  
##                         21.158                          69.955  
##               Buy_RegionKansas              Buy_RegionKentucky  
##                         14.667                          65.855  
##            Buy_RegionLouisiana                 Buy_RegionMaine  
##                         12.868                         -27.434  
##             Buy_RegionMaryland         Buy_RegionMassachusetts  
##                         40.289                          35.760  
##             Buy_RegionMichigan             Buy_RegionMinnesota  
##                         22.145                          44.706  
##          Buy_RegionMississippi              Buy_RegionMissouri  
##                         -1.453                          25.871  
##              Buy_RegionMontana              Buy_RegionNebraska  
##                         15.601                          13.103  
##               Buy_RegionNevada         Buy_RegionNew Hampshire  
##                         94.923                          31.946  
##           Buy_RegionNew Jersey            Buy_RegionNew Mexico  
##                         56.419                          29.722  
##             Buy_RegionNew York        Buy_RegionNorth Carolina  
##                         49.741                          25.012  
##         Buy_RegionNorth Dakota                  Buy_RegionOhio  
##                         28.955                          34.781  
##             Buy_RegionOklahoma                Buy_RegionOregon  
##                         42.313                          79.521  
##         Buy_RegionPennsylvania          Buy_RegionRhode Island  
##                         26.475                          23.189  
##       Buy_RegionSouth Carolina          Buy_RegionSouth Dakota  
##                         33.013                          -6.779  
##            Buy_RegionTennessee                 Buy_RegionTexas  
##                         23.830                          23.128  
##                 Buy_RegionUtah               Buy_RegionVermont  
##                         56.187                          67.457  
##             Buy_RegionVirginia            Buy_RegionWashington  
##                         58.204                          51.480  
##        Buy_RegionWest Virginia             Buy_RegionWisconsin  
##                        -28.111                          40.476  
##              Buy_RegionWyoming  
##                        -59.363
ggplot(sneakerRegionPlotData, aes(x = Buy_Region, y= Avg_Profit)) +  geom_point() + stat_smooth(method = lm) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
## `geom_smooth()` using formula = 'y ~ x'

The program is unable to create a regression line. We can still analyze the summary of the model.

summary(regionModel)
## 
## Call:
## lm(formula = Avg_Profit ~ Buy_Region, data = sneakerRegionPlotData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.089e-08  0.000e+00  0.000e+00  0.000e+00  1.109e-07 
## 
## Coefficients:
##                                  Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)                     1.839e+02  1.686e-11  1.090e+13   <2e-16 ***
## Buy_RegionAlaska                4.464e+01  4.914e-11  9.083e+11   <2e-16 ***
## Buy_RegionArizona               5.647e+01  1.942e-11  2.907e+12   <2e-16 ***
## Buy_RegionArkansas              1.264e+01  3.218e-11  3.929e+11   <2e-16 ***
## Buy_RegionCalifornia            8.754e+01  1.706e-11  5.131e+12   <2e-16 ***
## Buy_RegionColorado              4.065e+01  2.051e-11  1.982e+12   <2e-16 ***
## Buy_RegionConnecticut           1.774e+01  2.004e-11  8.856e+11   <2e-16 ***
## Buy_RegionDelaware              1.138e+02  1.972e-11  5.768e+12   <2e-16 ***
## Buy_RegionDistrict of Columbia  6.012e+01  2.764e-11  2.175e+12   <2e-16 ***
## Buy_RegionFlorida               5.398e+01  1.746e-11  3.092e+12   <2e-16 ***
## Buy_RegionGeorgia               3.747e+01  1.884e-11  1.989e+12   <2e-16 ***
## Buy_RegionHawaii                1.006e+02  2.497e-11  4.027e+12   <2e-16 ***
## Buy_RegionIdaho                -1.052e+01  3.872e-11 -2.718e+11   <2e-16 ***
## Buy_RegionIllinois              3.586e+01  1.785e-11  2.008e+12   <2e-16 ***
## Buy_RegionIndiana               2.116e+01  2.027e-11  1.044e+12   <2e-16 ***
## Buy_RegionIowa                  6.996e+01  2.381e-11  2.938e+12   <2e-16 ***
## Buy_RegionKansas                1.467e+01  2.582e-11  5.681e+11   <2e-16 ***
## Buy_RegionKentucky              6.586e+01  2.347e-11  2.806e+12   <2e-16 ***
## Buy_RegionLouisiana             1.287e+01  2.294e-11  5.609e+11   <2e-16 ***
## Buy_RegionMaine                -2.743e+01  3.562e-11 -7.702e+11   <2e-16 ***
## Buy_RegionMaryland              4.029e+01  1.881e-11  2.142e+12   <2e-16 ***
## Buy_RegionMassachusetts         3.576e+01  1.814e-11  1.971e+12   <2e-16 ***
## Buy_RegionMichigan              2.214e+01  1.820e-11  1.216e+12   <2e-16 ***
## Buy_RegionMinnesota             4.471e+01  2.153e-11  2.076e+12   <2e-16 ***
## Buy_RegionMississippi          -1.453e+00  3.289e-11 -4.417e+10   <2e-16 ***
## Buy_RegionMissouri              2.587e+01  2.194e-11  1.179e+12   <2e-16 ***
## Buy_RegionMontana               1.560e+01  5.419e-11  2.879e+11   <2e-16 ***
## Buy_RegionNebraska              1.310e+01  2.854e-11  4.591e+11   <2e-16 ***
## Buy_RegionNevada                9.492e+01  2.119e-11  4.480e+12   <2e-16 ***
## Buy_RegionNew Hampshire         3.195e+01  2.870e-11  1.113e+12   <2e-16 ***
## Buy_RegionNew Jersey            5.642e+01  1.766e-11  3.195e+12   <2e-16 ***
## Buy_RegionNew Mexico            2.972e+01  2.910e-11  1.021e+12   <2e-16 ***
## Buy_RegionNew York              4.974e+01  1.709e-11  2.910e+12   <2e-16 ***
## Buy_RegionNorth Carolina        2.501e+01  1.952e-11  1.281e+12   <2e-16 ***
## Buy_RegionNorth Dakota          2.896e+01  4.811e-11  6.018e+11   <2e-16 ***
## Buy_RegionOhio                  3.478e+01  1.879e-11  1.851e+12   <2e-16 ***
## Buy_RegionOklahoma              4.231e+01  2.449e-11  1.728e+12   <2e-16 ***
## Buy_RegionOregon                7.952e+01  1.736e-11  4.581e+12   <2e-16 ***
## Buy_RegionPennsylvania          2.648e+01  1.806e-11  1.466e+12   <2e-16 ***
## Buy_RegionRhode Island          2.319e+01  2.567e-11  9.034e+11   <2e-16 ***
## Buy_RegionSouth Carolina        3.301e+01  2.264e-11  1.458e+12   <2e-16 ***
## Buy_RegionSouth Dakota         -6.779e+00  5.145e-11 -1.318e+11   <2e-16 ***
## Buy_RegionTennessee             2.383e+01  2.150e-11  1.108e+12   <2e-16 ***
## Buy_RegionTexas                 2.313e+01  1.751e-11  1.321e+12   <2e-16 ***
## Buy_RegionUtah                  5.619e+01  2.394e-11  2.347e+12   <2e-16 ***
## Buy_RegionVermont               6.746e+01  4.280e-11  1.576e+12   <2e-16 ***
## Buy_RegionVirginia              5.820e+01  1.864e-11  3.122e+12   <2e-16 ***
## Buy_RegionWashington            5.148e+01  1.882e-11  2.736e+12   <2e-16 ***
## Buy_RegionWest Virginia        -2.811e+01  3.267e-11 -8.605e+11   <2e-16 ***
## Buy_RegionWisconsin             4.048e+01  2.095e-11  1.932e+12   <2e-16 ***
## Buy_RegionWyoming              -5.936e+01  5.944e-11 -9.987e+11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.605e-10 on 99905 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 9.524e+24 on 50 and 99905 DF,  p-value: < 2.2e-16

There does not seem to be a relationship between the two variables.

Conclusions